T5 (Text-to-Text Transfer Transformer) is a transformer-based language model developed by researchers at Google AI. Introduced in a paper first posted to arXiv in October 2019 and published in the Journal of Machine Learning Research (JMLR) in 2020, T5 proposed a unified framework in which every natural language processing (NLP) task is cast as a text-to-text problem. Classification, translation, summarization, question answering, and even regression tasks are all reformulated so that both the input and the output are text strings. This simple but powerful idea allowed the same model architecture, training procedure, loss function, and hyperparameters to be applied across dozens of different tasks without modification.
The paper, titled "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer," was authored by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Beyond introducing the T5 model itself, the paper presented an extensive empirical study comparing pre-training objectives, architectures, unlabeled datasets, transfer learning approaches, and scaling strategies. This systematic investigation made the paper one of the most cited works in the field, and its conclusions shaped subsequent research on pre-training and fine-tuning language models.
Google open-sourced all T5 model checkpoints and code under the Apache 2.0 license, making the models freely available for research and commercial use. The original implementation was built on Mesh TensorFlow, with a later JAX/Flax-based reimplementation released as the T5x framework. T5 helped seed the modern open-source Hugging Face ecosystem, where T5 and its derivatives remain among the most downloaded models, and its design choices (text-to-text framing, span corruption, instruction tuning) continue to shape contemporary language models in 2026.
By 2019, transfer learning had become the dominant paradigm in NLP. Models like ELMo, GPT, and BERT demonstrated that pre-training on large amounts of unlabeled text, then fine-tuning on downstream tasks, consistently outperformed training from scratch. However, these models used different architectures and different task-specific output heads. BERT used an encoder-only architecture and required adding a classification layer for each task. GPT used a decoder-only architecture and generated text autoregressively. Both approaches worked well for certain categories of tasks but required custom modifications depending on the downstream application.
The T5 authors recognized that this diversity of approaches made it difficult to compare transfer learning methods fairly. Different papers used different architectures, objectives, datasets, and evaluation protocols, making it hard to isolate which factor was responsible for observed improvements. T5 was designed as a common framework that could serve as both a practical model and a controlled experimental testbed.
The key insight was that virtually any NLP task can be framed as taking text as input and producing text as output. For a sentiment classification task, the model receives the text "classify: This movie is great" and outputs the text "positive." For English-to-German translation, it receives "translate English to German: That is good" and outputs "Das ist gut." For summarization, it receives a document prefixed with "summarize:" and outputs a shorter version. Even regression tasks can be handled by having the model output a string representation of a number. This unified text-to-text format eliminates the need for task-specific architectures or output heads.
The authors framed the project as a survey-via-experiment, reimplementing leading approaches inside a single codebase to allow apples-to-apples ablation of choices that had previously been bundled together.
T5 follows the original encoder-decoder transformer architecture proposed by Vaswani et al. in the 2017 paper "Attention Is All You Need," with several modifications. The encoder processes the input text bidirectionally, and the decoder generates the output text autoregressively, attending to both the encoder's output and the previously generated tokens. This makes T5 a true seq2seq model in the lineage of statistical and neural machine translation systems, retargeted as a general NLP backbone.
T5 differs from the original transformer in three main ways:
Relative positional embeddings: Instead of the sinusoidal absolute positional encodings used in the original transformer, T5 uses a learned relative position bias. Each possible offset between two token positions is assigned a scalar bias that is added to the attention logits before the softmax. The model uses 32 learned embedding buckets with ranges that increase logarithmically up to an offset of 128 positions, beyond which all positions map to the same bucket. This scheme allows the model to generalize to sequence lengths not seen during training. Position embedding parameters are shared across all layers, though each attention head within a layer learns its own set of biases.
Pre-layer normalization: T5 places layer normalization before each sub-layer (attention or feed-forward) rather than after, a configuration known as "pre-norm." This improves training stability. T5 also uses a simplified "RMSNorm" variant of layer normalization that only rescales activations without recentering them (no bias term and no subtraction of the mean).
No bias terms: T5 removes bias terms from all dense layers and layer normalization throughout the model. This is a minor simplification that reduces parameter count slightly without harming performance.
The activation function in the feed-forward sub-layers is ReLU in the original T5; T5 v1.1 switched to GeLU/GeGLU for improved performance. The encoder uses fully bidirectional self-attention. The decoder uses causal self-attention plus cross-attention layers that read the full encoder output. This split is why T5 is sometimes described as inheriting the strengths of both BERT (bidirectional understanding) and GPT (autoregressive generation) in a single network.
T5 uses SentencePiece tokenization with a Unigram subword model. The vocabulary contains 32,128 tokens, including 100 sentinel tokens used during span corruption pre-training. SentencePiece operates on raw bytes and does not require pre-tokenization, so T5 inputs need no separate word-splitting step. The 100 reserved sentinel IDs (denoted <extra_id_0> through <extra_id_99> in the Hugging Face implementation) appear at both the input boundary (where masked spans are replaced) and the target boundary (where they delimit the spans the decoder must generate).
The original T5 paper released five model sizes, from 60 million to 11 billion parameters. All variants use the same vocabulary and the same encoder-decoder structure; they differ only in depth, width, and the number of attention heads. The encoder and decoder share the same depth and hidden dimension in all variants.
| Variant | Parameters | Encoder layers | Decoder layers | d_model | d_ff | Attention heads | d_kv |
|---|---|---|---|---|---|---|---|
| T5-Small | 60M | 6 | 6 | 512 | 2,048 | 8 | 64 |
| T5-Base | 220M | 12 | 12 | 768 | 3,072 | 12 | 64 |
| T5-Large | 770M | 24 | 24 | 1,024 | 4,096 | 16 | 64 |
| T5-3B | 3B | 24 | 24 | 1,024 | 16,384 | 32 | 128 |
| T5-11B | 11B | 24 | 24 | 1,024 | 65,536 | 128 | 128 |
In this table, d_model is the hidden dimension, d_ff is the inner feed-forward dimension, and d_kv is the dimension of each attention head's key and value projections. T5-3B and T5-11B achieve their larger sizes primarily by widening the feed-forward layers and increasing attention heads, rather than adding more layers. Wide FFNs were considered cheaper to parallelize across TPU cores than very deep stacks, since each additional layer adds latency in training and inference.
The embedding matrix was tied between the encoder and decoder in the original T5 release, sharing a single 32,128 by d_model parameter table. This sharing was later removed in T5 v1.1.
To pre-train T5, the authors created a new dataset called the Colossal Clean Crawled Corpus (C4). C4 was derived from the April 2019 snapshot of Common Crawl, a publicly available archive of web pages. The raw Common Crawl data contained roughly 1.4 trillion tokens, but the vast majority of this text was low-quality, duplicated, or not natural English.
The cleaning pipeline applied the following filters:
After this cleaning process, C4 contained approximately 750 gigabytes of "reasonably clean and natural English text," roughly two orders of magnitude larger than Wikipedia. The dataset was made publicly available to the research community, and it has since been widely used for pre-training other language models. A later audit by Dodge et al. (2021) documented that C4 spans content from over 365 million internet domains.
The Dodge audit also found systematic biases: the offensive-content blocklist disproportionately removed pages from minority dialect communities, and a substantial fraction of the corpus came from a handful of large patent and news websites. These findings informed the design of later web-scale corpora such as The Pile, RedPajama, and FineWeb.
A companion multilingual dataset, mC4, was released alongside mT5. mC4 applies the same heuristics to Common Crawl data in 101 languages identified by Google's cld3 detector, totaling roughly 6.3 trillion tokens.
T5 uses a denoising pre-training objective called span corruption (sometimes called "span masking" or "fill-in-the-blank"). During pre-training, 15% of the tokens in each input sequence are selected for corruption. Rather than masking individual tokens independently (as in BERT's masked language modeling), T5 groups consecutive corrupted tokens into spans. Each span is replaced with a single unique sentinel token (<extra_id_0>, <extra_id_1>, and so on). The target output consists of the original tokens from each corrupted span, delimited by the corresponding sentinel tokens.
For example, if the input is "The quick brown fox jumps over the lazy dog" and the tokens "quick brown" and "lazy" are selected for corruption, the model's input becomes:
The
<extra_id_0>fox jumps over the<extra_id_1>dog
And the target output is:
<extra_id_0>quick brown<extra_id_1>lazy<extra_id_2>
The trailing <extra_id_2> marks the end of the last span. The average span length is 3 tokens. The authors found that span corruption outperformed other objectives they tested, including standard language modeling, BERT-style per-token masking, and deshuffling. The 15% corruption rate was chosen based on BERT precedent; rates of 10%, 15%, and 25% all yielded similar results.
The table below contrasts span corruption with other denoising objectives evaluated in the T5 paper. The first three are the main pre-training objectives compared in the systematic study; the last two are auxiliary baselines.
| Objective | Input | Target |
|---|---|---|
| Causal language modeling (GPT-style) | "The quick brown fox jumps" | "The quick brown fox jumps over the lazy dog" |
| BERT-style MLM (per-token) | "The [MASK] [MASK] fox jumps over the [MASK] dog" | "The quick brown fox jumps over the lazy dog" |
| Span corruption (T5) | "The <X> fox jumps over the <Y> dog" | "<X> quick brown <Y> lazy <Z>" |
| Deshuffling | "jumps fox brown the over quick lazy dog the" | "The quick brown fox jumps over the lazy dog" |
| Prefix language modeling | "The quick brown fox" | "jumps over the lazy dog" |
An important advantage of span corruption is that the target sequences are much shorter than the full input, since only the corrupted tokens need to be predicted. This makes pre-training more computationally efficient than objectives that require the model to reconstruct the entire input. Because the decoder never has to regenerate the unmasked context, each gradient step does less work for the same conceptual task.
Later work generalized span corruption in several directions: UL2's mixture-of-denoisers expanded the family to include very long spans and prefix language modeling, LongT5 used PEGASUS principle sentence generation, and SpanBERT applied span masking to an encoder-only model. The general lesson is that masking contiguous regions of meaning gives a richer training signal than masking individual subword tokens chosen at random.
All T5 models were pre-trained with a maximum sequence length of 512 tokens for both the encoder input and decoder target. Sequences were packed so that multiple shorter examples could be combined into a single 512-token sequence, improving training efficiency. The batch size was 128 sequences, meaning each batch contained approximately 65,536 tokens (2^16).
The models were trained for 2^19 steps (524,288 steps), exposing them to roughly 34 billion tokens. Since C4 contains far more than 34 billion tokens, the model never repeated any training data during pre-training. By the standards of 2026, this is a tiny pre-training budget; modern models like Llama 3 train on 15 trillion tokens. T5-11B's strong results on this small budget are a useful reminder that data, model size, and compute are not the only determinants of capability.
The learning rate followed an inverse square root schedule: lr = 1/sqrt(max(n, k)), where n is the current training step and k = 10,000 is a warmup constant. For the first 10,000 steps, the learning rate is held at the constant value 1/sqrt(10,000) = 0.01, and after that it decays as 1/sqrt(n). Training used the AdaFactor optimizer rather than Adam, as AdaFactor uses less memory by maintaining moving averages of row and column sums of the squared gradients rather than the full matrix. AdaFactor was developed by Noam Shazeer (one of the T5 authors) and has since become the default optimizer for very large transformer pre-training runs, including PaLM and many of Google's later models.
The largest T5-11B model was trained on TPU v3 pods, with up to 1,024 chips arranged in a 2D torus. All models were implemented using the Mesh TensorFlow library, which supports efficient model parallelism across multiple TPU chips by partitioning tensors along named axes ("mesh dimensions") rather than working at the device level. This abstraction made it possible to express T5-11B's parameter sharding in a single declarative configuration file.
Fine-tuning used a much smaller batch size (typically 8 examples per batch, but with the same packed-sequence trick) and a constant learning rate of 0.001 in most experiments. Each downstream task was fine-tuned for a fixed number of steps and then evaluated; checkpoint selection was based on validation performance.
A defining feature of the T5 paper is its comprehensive empirical study. Rather than simply proposing a new model, the authors systematically evaluated how different design choices affect transfer learning performance. They used the T5-Base model (220M parameters) for most experiments, training each variant on C4 and evaluating on a suite of benchmarks including GLUE, SuperGLUE, CNN/Daily Mail summarization, SQuAD, and WMT translation.
The study compared the following factors:
| Factor | Variations tested |
|---|---|
| Architecture | Encoder-decoder, decoder-only, prefix language model |
| Pre-training objective | Language modeling, span corruption, BERT-style MLM, deshuffling |
| Corruption rate | 10%, 15%, 25%, 50% |
| Corruption span length | Individual tokens (i.i.d.), mean span length 2, 3, 5, 10 |
| Dataset | C4, unfiltered Common Crawl, Wikipedia + BooksCorpus, WebText-like |
| Dataset size | Full C4, 2^29 tokens, 2^27 tokens, 2^25 tokens, 2^23 tokens |
| Fine-tuning strategy | Full model fine-tuning, adapter layers, gradual unfreezing |
| Multi-task learning | Pre-train then fine-tune, multi-task pre-training, leave-one-out multi-task |
| Scaling | More parameters vs. more training steps vs. ensembling |
The study produced several influential conclusions:
Encoder-decoder models outperform decoder-only models when both have the same total number of parameters. The encoder-decoder architecture provides a natural separation between understanding the input and generating the output, which benefits seq2seq tasks like translation and summarization. Importantly, this comparison was at the parameter count of T5-Base (~220M); the conclusion does not necessarily extrapolate to the very largest scales, where decoder-only models like GPT-3 and PaLM later achieved strong results.
Span corruption outperforms other pre-training objectives. Denoising objectives (predicting corrupted tokens) consistently beat autoregressive language modeling for downstream task performance, while also being more computationally efficient because the decoder targets are shorter. Among span lengths tested (1, 2, 3, 5, and 10 tokens on average), 3 was the best, and the authors adopted that as the default.
Pre-training on in-domain data helps, but dataset size matters more. Models pre-trained on a small domain-specific dataset (like Wikipedia) could outperform models pre-trained on a larger but noisier dataset, but the best results came from a large, cleaned dataset (C4). Repeating a small dataset for many epochs degraded downstream performance.
Pre-training then fine-tuning outperforms multi-task learning for most tasks, although multi-task pre-training followed by fine-tuning can match or exceed pure pre-training in some settings.
Scaling up model size, training data, and training time all improve performance, with model size having the largest impact per additional FLOP. However, ensembling multiple models also provides significant gains. The authors did not propose a formal scaling law (that would come a year later in Kaplan et al. and the Chinchilla paper), but the qualitative trends were consistent with later findings.
Fine-tuning all model parameters works best. While adapter layers (which freeze the pre-trained weights and only train small additional layers) are more parameter-efficient, full fine-tuning consistently produced better results. This conclusion was later qualified by the PEFT and LoRA literature, which showed that small, well-designed adapters can match full fine-tuning at a fraction of the parameter cost.
These findings guided the design of the final T5 models and influenced subsequent work on scaling language models. The empirical methodology, where each factor is varied while holding everything else constant on a standardized base configuration, has been widely emulated in scaling studies and instruction-tuning ablations.
The largest T5-11B model achieved state-of-the-art results on a wide range of NLP benchmarks when it was released.
T5-11B set new state-of-the-art scores on the GLUE benchmark (a collection of nine sentence-level classification tasks) and the SuperGLUE benchmark (a harder successor to GLUE with eight tasks). On SuperGLUE, T5-11B achieved an average score of 88.9, approaching the human baseline of 89.8. This performance on SuperGLUE was a significant milestone, as the benchmark had been designed specifically to be challenging for contemporary models.
On the Stanford Question Answering Dataset (SQuAD), T5-11B achieved strong results by framing the reading comprehension task as text generation. Rather than predicting start and end positions in the passage (as BERT-style models do), T5 simply generates the answer text directly. T5-11B reached 91.26 F1 / 85.44 exact match on SQuAD v1.1 and 90.06 F1 / 86.06 EM on SQuAD v2.0 in the paper's reported settings.
On the CNN/Daily Mail summarization benchmark, T5-11B set a new state-of-the-art ROUGE score. T5 also did well on XSum (a more abstractive single-sentence-summary benchmark), although PEGASUS, designed specifically for summarization, eventually surpassed T5 at much smaller parameter counts.
T5-11B was evaluated on WMT 2014 English-German and English-French translation, outperforming prior pre-trained encoder-decoder baselines but not the dedicated translation systems trained on parallel data that won those competitions. Pre-training on monolingual English text limited T5's translation performance compared to systems with explicit parallel corpora.
In a follow-up study by Roberts et al. (2020), T5 was evaluated on closed-book question answering, where the model must answer factual questions without access to any external documents, relying entirely on knowledge stored in its parameters. T5-11B achieved 50.1% exact match accuracy on TriviaQA and 34.5% on Natural Questions in the closed-book setting, demonstrating that large language models can store and recall substantial amounts of world knowledge. The same paper introduced "salient span masking," an auxiliary objective that masks named entities and dates rather than random spans, which boosted closed-book QA performance by several percentage points.
The text-to-text framework is the conceptual core of T5. Each task is converted to a text-to-text format by prepending a task-specific text prefix to the input. The prefixes used during fine-tuning include:
| Task | Input prefix | Example input | Example output |
|---|---|---|---|
| English-German translation | "translate English to German:" | "translate English to German: That is good." | "Das ist gut." |
| English-French translation | "translate English to French:" | "translate English to French: That is good." | "C'est bien." |
| Sentiment classification | "sst2 sentence:" | "sst2 sentence: This movie is great." | "positive" |
| Sentence similarity | "stsb sentence1: ... sentence2:" | "stsb sentence1: The cat sat. sentence2: A cat is sitting." | "4.2" |
| Summarization | "summarize:" | "summarize: [long article text]" | "[short summary]" |
| Question answering | "question: ... context:" | "question: Who wrote Hamlet? context: ..." | "William Shakespeare" |
| Linguistic acceptability | "cola sentence:" | "cola sentence: The bird sang in." | "unacceptable" |
| Natural language inference | "mnli hypothesis: ... premise:" | "mnli hypothesis: A person is walking. premise: ..." | "entailment" |
| Coreference resolution | "wsc:" | "wsc: The trophy didn't fit because it was too big." | "trophy" |
This design has several practical advantages. A single model checkpoint can be fine-tuned (or even used zero-shot) on any task by simply changing the prefix. There is no need to design task-specific output heads or loss functions. The cross-entropy loss over generated tokens serves as the universal training objective.
A potential disadvantage is that for classification tasks, the model must generate an entire text label (e.g., "positive" or "entailment") token by token, which is less efficient than producing a single logit. In practice, the overhead is small because most label strings are just one or two tokens long.
For regression tasks like STS-B, where the target is a real number from 0 to 5, the authors quantized to the nearest 0.2 and treated the resulting 21 values as discrete classes. The model emits a string like "4.2" and a custom decoder converts back to a float. This trick generalizes the text-to-text framing to numeric outputs without breaking the unified loss function.
T5 occupies a distinct position in the landscape of pre-trained language models. Whereas BERT uses only the encoder half of the transformer and GPT uses only the decoder half, T5 uses the full encoder-decoder architecture. This has implications for what types of tasks each model handles well.
| Feature | BERT | GPT | T5 |
|---|---|---|---|
| Architecture | Encoder-only | Decoder-only | Encoder-decoder |
| Attention pattern | Bidirectional (full) | Causal (left-to-right) | Bidirectional encoder, causal decoder |
| Pre-training objective | Masked language modeling | Autoregressive language modeling | Span corruption (denoising) |
| Task adaptation | Add task-specific output head | Prompt + generation | Text prefix + generation |
| Natural fit | Classification, token labeling, extraction | Text generation, dialogue | Seq2seq tasks: translation, summarization, QA |
| Output format | Class labels or span pointers | Free-form text | Free-form text |
| Example sizes | 110M (Base), 340M (Large) | 117M (GPT-1), 1.5B (GPT-2) | 60M (Small) to 11B (11B) |
| Inference cost | One forward pass | One pass per generated token | Encoder pass + one decoder pass per token |
| Training year | 2018 | 2018-2019 | 2019-2020 |
BERT's encoder-only architecture makes it strong for understanding tasks (classification, named entity recognition, extractive QA) but unable to generate free-form text. GPT's decoder-only architecture excels at text generation but processes input only from left to right, limiting its ability to fully attend to bidirectional context. T5's encoder-decoder architecture provides bidirectional encoding of the input and autoregressive generation of the output, making it naturally suited for tasks that require both understanding an input and producing a structured output.
The T5 paper's systematic comparison confirmed that encoder-decoder models tend to outperform decoder-only models of the same total parameter count on the benchmarks studied. However, as decoder-only models have continued to scale (with GPT-3 reaching 175 billion parameters and later models growing even larger), the decoder-only paradigm has become dominant for general-purpose language models, partly because it simplifies the training pipeline and scales more efficiently for pure generation tasks.
Inference patterns also differ. Decoder-only models process the prompt and generated continuation through one causal-attention stack, with KV caching amortizing prompt cost across all generated tokens. Encoder-decoder models run the prompt through the encoder once, then generate each output token in the decoder. For long generations on short prompts, decoder-only is more efficient. For short generations on long prompts (summarization, rerank scoring), encoder-decoder is competitive or faster.
T5's design has been extended, refined, and adapted in numerous follow-up models. The table below lists the most influential follow-ups; each is described in more detail in the subsections that follow.
| Model | Year | Authors | Key idea | Notable sizes |
|---|---|---|---|---|
| T5 v1.1 | 2020 | GeGLU activation, no embedding tying, dropout off in pre-training | Small (60M) to XXL (11B) | |
| mT5 | 2020 | Xue et al. | Multilingual pre-training on 101 languages via mC4 | Small to XXL (11B) |
| ByT5 | 2021 | Xue et al. | Tokenizer-free, byte-level inputs | Small to XXL (12B) |
| Flan-T5 | 2022 | Chung et al. | Instruction tuning on 1,836 task mixture | Small (60M) to XXL (11B) |
| LongT5 | 2022 | Guo et al. | Local + transient global attention for long inputs | Base, Large, XL (3B) |
| UL2 | 2022 | Tay et al. | Mixture-of-denoisers pre-training | UL2-20B |
| Switch Transformer | 2021 | Fedus et al. | Sparse mixture-of-experts on T5 backbone | Up to 1.6T params (Switch-C) |
| GLaM | 2021 | Du et al. | Decoder-only MoE inspired by T5/Switch lessons | Up to 1.2T params |
| CodeT5 | 2021 | Wang et al. (Salesforce) | Identifier-aware code pre-training | Small, Base, Large |
| CodeT5+ | 2023 | Wang et al. (Salesforce) | Flexible encoder-decoder/decoder-only modes | 220M to 16B |
| mT0 | 2022 | BigScience | Multilingual instruction tuning of mT5 on xP3 | Small to XXL (13B) |
| Chronos | 2024 | Ansari et al. (Amazon) | T5-based time-series forecasting foundation model | Tiny to Large |
Google released T5 v1.1 as an improved version of the original T5 with several changes: (1) the feed-forward activation function was changed from ReLU to GeGLU (Gated Linear Unit with GeLU activation), which improved quality; (2) dropout was disabled during pre-training; (3) the model was pre-trained on C4 only, without mixing in any downstream task data; and (4) the encoder and decoder embedding matrices were no longer tied. The model shapes also changed slightly: T5 v1.1 uses a larger d_model with smaller num_heads and d_ff at each size compared to T5 v1.0. The renaming convention also shifted, with "3B" becoming "XL" and "11B" becoming "XXL." These changes produced better downstream performance without changing the model sizes.
A practical consequence of dropping the supervised mixing during pre-training is that T5 v1.1 cannot be used out of the box on any downstream task; it must be fine-tuned first. The original T5 could be used zero-shot for any task whose prefix it had seen during the multi-task pre-training mix, but T5 v1.1 was pre-trained purely on the unsupervised span corruption objective. Most practitioners working with T5 today use the v1.1 weights as starting points for further fine-tuning or the Flan-T5 weights for zero-shot use.
Flan-T5 is an instruction-tuned version of T5 released by Google in October 2022 alongside the paper "Scaling Instruction-Finetuned Language Models" by Hyung Won Chung and colleagues. The model was initialized from a pre-trained T5 v1.1 checkpoint and then fine-tuned on a curated mixture of 1,836 tasks drawn from multiple sources, including Flan 2021, P3++, Super-Natural Instructions, and custom additions spanning question answering, natural language inference, code, dialogue, and chain-of-thought reasoning.
The instruction-tuning process used a mix of three prompting formats: zero-shot (instructions only), few-shot (instructions with examples), and chain-of-thought (instructions that ask for step-by-step reasoning). Training with all three formats yielded roughly 2% higher accuracy across all evaluation settings compared to training with any single format.
Flan-T5 comes in five sizes corresponding to the original T5 variants:
| Variant | Parameters |
|---|---|
| Flan-T5-Small | 60M |
| Flan-T5-Base | 250M |
| Flan-T5-Large | 780M |
| Flan-T5-XL | 3B |
| Flan-T5-XXL | 11B |
Flan-T5 outperformed the original T5 by 3% to 17% or more across various evaluation settings. It also demonstrated that instruction-tuned models converge faster and reach higher accuracy when further fine-tuned on individual downstream tasks, making them more computationally efficient starting checkpoints. Flan-T5 became one of the most widely used open-source language models for research and production applications, and it played an important role in popularizing instruction tuning as a training methodology. As of 2026, Flan-T5-XXL remains a common baseline in instruction-following research and a default starting point for academic teams that need a capable model under 15 billion parameters.
The broader Flan effort, described in Longpre et al.'s 2023 Flan Collection paper, formalized many ad-hoc data-mixing decisions for instruction-tuning sets. The methodology has been reused in many later open models, including Falcon-Instruct, MPT-Instruct, and several Llama variants.
mT5 (Multilingual T5) is a multilingual variant of T5 introduced by Xue et al. in 2020. It follows the same architecture and pre-training objective as T5 but was pre-trained on mC4, a multilingual version of C4 covering 101 languages. The mC4 corpus was created by applying the same cleaning heuristics used for C4 to Common Crawl data in each language, using the cld3 library for language identification. The resulting corpus contains approximately 6.3 trillion tokens across all languages.
Languages were sampled during pre-training using a temperature-weighted scheme that boosts low-resource languages relative to their natural frequency, similar to the approach used in mBERT and XLM-R. The vocabulary was expanded to 250,112 SentencePiece tokens (compared to T5's 32,128) to give adequate coverage to the 101 languages.
mT5 was released in the same five sizes as T5 (Small through 11B) and achieved state-of-the-art results on several multilingual benchmarks, including XTREME and XNLI. The authors also described a technique to prevent "accidental translation" in zero-shot cross-lingual transfer, where a generative model might produce output in the wrong language; the fix is a small mixture of unsupervised span corruption examples in the target language during fine-tuning.
ByT5 (Byte-level T5), introduced by Xue et al. in 2022, operates directly on raw UTF-8 byte sequences rather than using a subword tokenizer. This eliminates the need for a tokenizer entirely and makes the model robust to misspellings, character-level noise, and morphologically rich languages where subword tokenization can be suboptimal. ByT5 was particularly effective on tasks involving word-internal phenomena such as spelling correction, pronunciation prediction, and morphological analysis. The trade-off is that byte sequences are typically longer than subword token sequences, increasing computational cost.
To offset the longer input length, ByT5 redistributes capacity in favor of the encoder: the encoder is roughly three times deeper than the decoder, on the theory that byte-level inputs need extra encoding work to assemble word-like representations, while the decoder can stay relatively shallow once it has those representations to attend to. Pre-training masks spans of approximately 20 bytes (rather than 3 SentencePiece tokens), reflecting the fact that bytes carry less information per unit than subword tokens. ByT5 is competitive with parameter-matched mT5 across many tasks and clearly better on noisy or morphologically rich inputs.
LongT5 (Guo et al., 2022) extends T5 to long input sequences. Standard transformer attention has quadratic time and memory complexity with sequence length, which makes long documents expensive to process. LongT5 uses local attention (a sliding window) in the encoder, reducing complexity to linear, optionally augmented with "transient global" tokens that summarize each block. LongT5 was pre-trained using the PEGASUS principle sentence generation objective and demonstrated strong performance on long-document summarization tasks such as arXiv, PubMed, and BigPatent. Sizes range from Base (220M) to XL (3B).
CodeT5 (Wang et al., Salesforce, EMNLP 2021) adapted the T5 architecture for programming tasks. It was pre-trained on CodeSearchNet covering Ruby, JavaScript, Go, Python, Java, and PHP. The distinctive feature is an identifier-aware pre-training objective: the model learns to identify which tokens are identifiers (variable, function, and class names) and to recover masked identifiers. CodeT5 also uses a bimodal dual-generation objective pairing source code with natural-language docstrings.
CodeT5 supports code summarization, generation, translation between languages, and defect detection. CodeT5+ (2023) extended this with models ranging from 220 million to 16 billion parameters and a flexible architecture that can operate in encoder-only, decoder-only, or encoder-decoder mode depending on the task.
UL2 (Unifying Language Learning Paradigms), introduced by Yi Tay, Mostafa Dehghani, and colleagues at Google in 2022, built directly on the T5 architecture but proposed a fundamentally different pre-training objective. Rather than using a single denoising task, UL2 uses a Mixture-of-Denoisers (MoD) approach that combines three types of denoising:
UL2 also introduced the concept of "mode switching," where special tokens in the input signal which denoising mode was used, allowing the model to adapt its behavior at inference time. The UL2-20B model outperformed both T5 and GPT-style models of comparable size across 50 supervised NLP benchmarks, beat 175B GPT-3 on zero-shot SuperGLUE, and tripled the one-shot summarization performance of T5-XXL. The mixture-of-denoisers idea later influenced PaLM 2's pre-training mixture and is widely cited as one of the inspirations for modern pre-training recipes that interleave multiple objectives.
Switch Transformer (Fedus, Zoph, and Shazeer, 2021) applied sparse mixture-of-experts (MoE) to a T5 backbone. Each MoE layer replaces the dense FFN with a router that sends each token to exactly one of K experts (the "switch" simplification of top-k routing). Switch Transformers maintained T5-level quality with 4-7x faster pre-training, and the largest variant Switch-C reached 1.571 trillion parameters, one of the first publicly described trillion-parameter models. GLaM (Du et al., 2021) brought MoE to a decoder-only model. Both works seeded the architectures used in Mixtral, DeepSeek-MoE, and other later sparse models.
mT0 is a family of multilingual instruction-tuned models developed by the BigScience workshop, built by fine-tuning mT5 on xP3 (Crosslingual Public Pool of Prompts), a collection of prompts and tasks drawn from 46 languages and 16 NLP tasks. The result is an mT5 derivative capable of zero-shot instruction following in dozens of languages. mT0 sizes range from Small (300M) through XXL (13B). The companion BLOOMZ models apply the same xP3 recipe to BLOOM, showing that the multilingual instruction tuning recipe generalizes across architectures.
Chronos (Ansari et al., Amazon, 2024) repurposes the T5 architecture for time-series forecasting. Time-series values are scaled and quantized into a small vocabulary of bin tokens (4,096 in Chronos-T5, far smaller than T5's 32,128 word-piece vocabulary), and the model predicts future tokens autoregressively from a window of past tokens. Pre-training on a broad collection of public time-series datasets gives Chronos competitive zero-shot performance on unseen series. Chronos-Bolt, a wider, shallower follow-up trained on roughly 100 billion observations, is available through Amazon SageMaker JumpStart and Amazon Bedrock Marketplace; Chronos checkpoints have been downloaded from Hugging Face well over 100 million times.
T5x is a research framework developed by Google for training, evaluating, and serving sequence models. Built on JAX and Flax, T5x provides the reference implementation for T5 and its variants and replaces the older Mesh TensorFlow code. The framework was described in the March 2022 paper "Scaling Up Models and Data with t5x and seqio" (Roberts, Chung, Levskaya, et al.). It uses XLA's GSPMD partitioner via jax.pjit to express data, model, and activation parallelism uniformly, and pairs with the SeqIO library for reproducible data pipelines. T5x has been used internally at Google to train PaLM, mT5, UL2, Switch Transformers, and many other large models.
T5 and its variants have been applied across a broad range of NLP tasks in both research and industry.
T5's encoder-decoder architecture makes it a natural fit for abstractive summarization. T5-11B set state-of-the-art results on CNN/Daily Mail, and Flan-T5 has been widely adopted for summarization in production systems where a fully decoder-only LLM would be overkill or too expensive. LongT5 extended this capability to inputs of up to tens of thousands of tokens.
Although T5 was primarily designed for English, the text-to-text format handles translation naturally. The mT5 variant extended this to 101 languages and has been used as a backbone for low-resource translation research.
T5 can perform both extractive and generative question answering. In the closed-book setting, the model generates answers purely from its parametric knowledge, as Roberts et al. (2020) showed. T5 has also been used as the generator in retrieval-augmented systems such as Fusion-in-Decoder (FiD) by Izacard and Grave (2021), where multiple retrieved passages are independently encoded and the decoder attends to all of them jointly.
By framing classification as text generation (outputting a label string), T5 can handle any classification task without a task-specific head: sentiment analysis, topic categorization, natural language inference, and content moderation. T5 has been particularly successful as a search reranker. monoT5 (Nogueira et al., 2020) reformulated passage reranking as the binary text-to-text task "Document: ... Query: ... Relevant:" with target "true" or "false," and achieved strong results on MS MARCO and the TREC Deep Learning passage tracks. T5-based cross-encoder rerankers remain a common second-stage component in retrieve-and-rerank pipelines, although newer reranker architectures based on encoder-only or LLM backbones now lead the leaderboards in 2026.
The CodeT5 family demonstrated that the T5 architecture transfers well to programming tasks: summarization, generation from natural language descriptions, translation between languages, and bug detection. CodeT5+ extended this with up to 16B parameters and a flexible architecture that can run in encoder-only, decoder-only, or encoder-decoder mode.
T5 has been applied to structured data-to-text tasks such as generating natural language from tables, knowledge graphs, or database records, evaluated on WebNLG, ToTTo, and DART.
Fine-tuned T5 models have been used for named entity recognition, relation extraction, and event extraction by generating a structured text output (often a JSON-like string of entities and relations).
The Chronos series shows that the T5 backbone can be repurposed for forecasting numerical time series by quantizing values into a small vocabulary. Other groups have used T5 for music tokenization, protein engineering (ProtT5 by Elnaggar et al., 2021), and chemical reaction prediction.
T5's impact extends beyond its own model family. The text-to-text framing demonstrated that casting all tasks as text generation is both practical and effective. This idea influenced later models, including GPT-3 (which uses text-to-text prompting for few-shot learning) and instruction-tuned models like InstructGPT and ChatGPT. The unified loss function and shared output head also simplified multi-task evaluation.
The T5 paper set a standard for rigorous empirical comparison of design choices. Its methodology of isolating individual factors has been adopted by later large-scale studies including Kaplan et al., the Chinchilla scaling laws paper, and the Llama technical reports. The Flan-T5 line of work was instrumental in showing that instruction tuning on a diverse mixture of tasks dramatically improves instruction following, directly influencing InstructGPT, ChatGPT, and many later open instruction-tuning efforts.
T5's finding that encoder-decoder models outperform decoder-only models at the same parameter count sparked ongoing discussion about optimal architectures. While decoder-only models have become dominant at the largest scales, encoder-decoder models like T5 remain competitive for cross-encoder retrieval, translation, and other tasks where input comprehension and output generation are distinct steps. Google's PaLM 2 reportedly drew on UL2's mixture-of-denoisers, an encoder-decoder lineage.
Google's decision to release T5 checkpoints, code, and training data freely helped build the open-source language model ecosystem. T5 models are among the most downloaded on Hugging Face, serving as the foundation for hundreds of fine-tuned models. The C4 dataset became one of the most widely reused pre-training corpora and inspired follow-ups such as RedPajama, The Pile, and FineWeb.
T5 has several known limitations. The original T5 models were trained exclusively on English text. While mT5 addresses multilingual coverage, the English-only C4 dataset limits the original model's applicability to other languages.
T5's maximum sequence length of 512 tokens limits its ability to process long documents without truncation or chunking. LongT5 was developed to address this. By the standards of 2026, when million-token context windows are common in frontier decoder-only models, T5's 512-token horizon is restrictive for many applications.
The T5-11B model requires substantial compute for both training and inference. Fine-tuning requires multiple GPUs or TPUs, and serving at scale is expensive. Inference is also slowed by the two-stage encoder-then-decoder pipeline, which is less efficient than a single decoder-only stack for long generations. For simple classification, generating a text label token by token adds latency compared to encoder-only models like BERT.
The C4 dataset, while cleaned, still contains biases, inaccuracies, and potentially harmful content. The Dodge et al. (2021) audit found that C4 underrepresents content from minority groups and certain regions, and the offensive-content blocklist removed proportionally more text from minority dialects.
T5 has been largely supplanted by decoder-only LLMs for general-purpose generation. As of 2026, the most widely used open weight models are decoder-only (Llama 3, Mistral, Qwen, DeepSeek), and frontier closed models like GPT-4 and Claude are also decoder-only. T5 remains popular as a research baseline, as a starting point for academic instruction tuning, and as a backbone for cross-encoder rerankers and time-series foundation models, but it is no longer the architecture of choice for new general-purpose LLM training runs.
All original T5 model checkpoints are available through the Hugging Face Transformers library under identifiers such as google-t5/t5-small, google-t5/t5-base, google-t5/t5-large, google-t5/t5-3b, and google-t5/t5-11b. Flan-T5 checkpoints are at google/flan-t5-small through google/flan-t5-xxl, mT5 at google/mt5-small through google/mt5-xxl, T5 v1.1 at google/t5-v1_1-*, ByT5 at google/byt5-*, and LongT5 at google/long-t5-*.
The original Mesh TensorFlow implementation lives at google-research/text-to-text-transfer-transformer; the newer T5x JAX/Flax implementation is at google-research/t5x. C4 is available through TensorFlow Datasets and as allenai/c4 on Hugging Face Datasets, with the multilingual mC4 variant at mc4. All original T5 models are released under Apache 2.0; later derivatives mostly follow the same licensing model, with exceptions noted on individual model cards.