Parti (text-to-image model)
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,246 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,246 words
Add missing citations, update stale details, or suggest a clearer explanation.
Parti (Pathways Autoregressive Text-to-Image) is a text-to-image generation model developed by researchers at Google. Described in the paper "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation," which was posted to arXiv on 22 June 2022, Parti generates images from natural-language descriptions by treating the task as a sequence-to-sequence problem rather than as a denoising process. The work is best known for two contributions: a demonstration that autoregressive image generation continues to improve as the model is scaled to 20 billion parameters, and PartiPrompts, a benchmark of more than 1,600 English prompts designed to stress-test text-to-image systems. Parti was released as a research result, with code, model weights, and training data deliberately withheld pending stronger safeguards.[1][2][3]
Parti emerged during a period of rapid progress in text-to-image generation in 2022. It was published within days of Imagen, another Google model announced that same month, and the two systems were explicitly framed as complementary explorations of different generative families: Parti is autoregressive, while Imagen is a diffusion model. The authors noted that pairing the two approaches could open further research directions.[1][2] Earlier autoregressive text-to-image systems, including OpenAI's DALL-E and the CogView models, had established the broad template that Parti built on, but Parti pushed the autoregressive recipe to a substantially larger scale.[3]
The seventeen-author paper was led by Jiahui Yu, with Yonghui Wu among the senior authors. The team came out of Google's research organization, drawing on the same engineering lineage as Google's Pathways system and the broader work of Google Brain and Google Research. The "Pathways" in the model's name reflects that infrastructure heritage rather than a separately released product.[1][4]
The defining feature of Parti is that it casts image generation as a translation-like task. Where a machine-translation model reads a sentence in one language and emits a sentence in another, Parti reads a text prompt and emits a sequence of discrete image tokens, which are then decoded back into pixels. This framing lets the model inherit techniques and intuitions from large language models, particularly the expectation that quality scales with model and data size.[1][2][3]
The system has two main components. First, a Transformer-based image tokenizer called ViT-VQGAN learns to encode an image into a sequence of discrete tokens and to reconstruct those tokens back into a high-fidelity image. ViT-VQGAN serves as both the encoder that produces training targets and the detokenizer that turns generated token sequences back into visible pictures.[1][3] Second, a standard encoder-decoder Transformer is trained to map the text prompt (consumed by the encoder) to the corresponding image-token sequence, with the decoder predicting tokens one at a time in autoregressive fashion.[2][3]
Because the image is represented as a flat sequence of tokens, the generation problem reduces to next-token prediction, the same objective used to train text-language models. That design choice is what allows Parti to benefit directly from the scaling behavior observed in large language modeling, and it is the central methodological difference between Parti and diffusion-based competitors such as Imagen and later systems like Imagen 2.[1][2]
Parti was studied at four sizes so the authors could measure how performance changes with scale. The encoder-decoder Transformer was trained at roughly 350 million, 750 million, 3 billion, and 20 billion parameters.[1][5]
| Model variant | Parameters |
|---|---|
| Parti-350M | 350 million |
| Parti-750M | 750 million |
| Parti-3B | 3 billion |
| Parti-20B | 20 billion |
The headline finding was that quality improved consistently as the model grew, with no sign of saturation at the largest size tested.[1][2] The 20-billion-parameter model set a new state-of-the-art zero-shot Frechet Inception Distance (FID) of 7.23 on the MS-COCO benchmark, and a finetuned FID of 3.22.[1][2][3] Human raters also preferred the largest model by wide margins over the 3-billion-parameter version: in side-by-side comparisons, the 20B outputs were favored 63.2 percent of the time for image realism and quality, and 75.9 percent of the time for how well the image matched the text.[1] The gains were most pronounced on prompts that demanded abstract reasoning, specific viewpoints, world knowledge, or the rendering of written text, suggesting that scale helps most on the hardest compositional cases.[1][2]
A recurring weakness of early text-to-image models was their inability to spell. Parti's scaling study showed that the largest model handled writing and symbol rendering noticeably better than smaller variants, producing more legible words and signs inside generated scenes.[1][2] More broadly, the authors emphasized content-rich synthesis: the ability to compose images that combine multiple objects in specified relationships, reflect particular artistic styles, respect requested perspectives, and incorporate factual world knowledge.[1][3]
The Parti project page illustrated these capabilities with progressively harder prompts, showing how a single detailed description could be rendered with increasing fidelity as model size grew. The authors framed prompt complexity itself as a useful axis of evaluation, since a model's failures become more informative as prompts move from simple object requests toward elaborate, multi-clause instructions.[1][2]
Alongside the model, the team released PartiPrompts (often abbreviated P2), a benchmark of more than 1,600 English prompts intended to probe text-to-image systems across a wide range of difficulty.[1][2][3] Rather than scoring only photorealism, P2 is organized along two independent axes. Each prompt is tagged with one of 12 broad content categories (covering subjects such as abstract concepts, animals, vehicles, outdoor scenes, and world knowledge) and one of 11 challenge dimensions (including basic, quantity, words and symbols, linguistic structures, imagination, fine-grained detail, and complex).[6]
This two-way tagging lets researchers analyze where a model struggles. A prompt like "7 dogs sitting around a poker table, two of which are turning away" falls under the animals category but tests the quantity challenge, while a request for an imaginary landscape might sit in outdoor scenes under the imagination challenge.[6] By separating subject matter from the kind of reasoning a prompt demands, PartiPrompts became a widely cited resource for evaluating later text-to-image models well beyond Parti itself.[6]
Parti is significant as a high-water mark for the autoregressive line of text-to-image research and as a clean demonstration that the scaling laws familiar from language modeling carry over to image generation. Together with Imagen, it showed Google pursuing both major generative paradigms in parallel, and the FID results placed autoregressive generation on competitive footing with diffusion at the time.[1][2] The reported numbers, including the 7.23 zero-shot FID and the 20-billion-parameter ceiling, became reference points in subsequent text-to-image literature.[3]
Equally influential was the decision not to ship the model. Citing concerns about bias, safety, and the potential for misuse, the authors chose not to release Parti's weights, code, or training data publicly without additional safeguards, while still releasing the PartiPrompts benchmark to support further research.[1][3] That posture mirrored a broader caution among large research labs around generative imagery in 2022. Within Google, generative-image and multimodal work would continue across the organization and into Google DeepMind after its 2023 formation, with PartiPrompts persisting as a standard evaluation set even as the underlying generative approach in many later products shifted toward diffusion.[6]